A Comprehensive Whole Genome Bacterial Phylogeny Using Correlated Peptide Motifs Defined in a High Dimensional Vector Space
نویسندگان
چکیده
As whole genome sequences continue to expand in number and complexity, effective methods for comparing and categorizing both genes and species represented within extremely large datasets are required. Methods introduced to date have generally utilized incomplete and likely insufficient subsets of the available data. We have developed an accurate and efficient method for producing robust gene and species phylogenies using very large whole genome protein datasets. This method relies on multidimensional protein vector definitions supplied by the singular value decomposition (SVD) of a large sparse data matrix in which each protein is uniquely represented as a vector of overlapping tetrapeptide frequencies. Quantitative pairwise estimates of species similarity were obtained by summing the protein vectors to form species vectors, then determining the cosines of the angles between species vectors. Evolutionary trees produced using this method confirmed many accepted prokaryotic relationships. However, several unconventional relationships were also noted. In addition, we demonstrate that many of the SVD-derived right basis vectors represent particular conserved protein families, while many of the corresponding left basis vectors describe conserved motifs within these families as sets of correlated peptides (copeps). This analysis represents the most detailed simultaneous comparison of prokaryotic genes and species available to date.
منابع مشابه
A comprehensive vertebrate phylogeny using vector representations of protein sequences from whole genomes.
We recently developed a method for producing comprehensive gene and species phylogenies from unaligned whole genome data using singular value decomposition (SVD) to analyze character string frequencies. This work provides an integrated gene and species phylogeny for 64 vertebrate mitochondrial genomes composed of 832 total proteins. In addition, to provide a theoretical basis for the method, we...
متن کاملWhole-Genome Sequencing of a Clinically Isolated Antibiotic-Resistant Enterococcus faecium EntfacYE
Background and Objective: Enterococcal infections are considered the most common nosocomial infections. Nowadays, enterococci show high resistance to common antibiotics, especially vancomycin. Vancomycin-resistant Enterococcus faecium is one of the most common nosocomial infections, which is included in the World Health Organization priority pathogens list for research and development of new an...
متن کاملImproving Phylogeny Reconstruction at the Strain Level Using Peptidome Datasets
Typical bacterial strain differentiation methods are often challenged by high genetic similarity between strains. To address this problem, we introduce a novel in silico peptide fingerprinting method based on conventional wet-lab protocols that enables the identification of potential strain-specific peptides. These can be further investigated using in vitro approaches, laying a foundation for t...
متن کاملEnhanced Solubility of Anti-HER2 scFv Using Bacterial Pelb Leader Sequence
Single chain Fragment variable (scFv) is an antibody fragment consisting variable regions of heavy and light chains. scFvs enhance their penetrability into tissues while maintaining specific affinity and having low immunogenicity. Insoluble inclusion bodies are formed when scFvs are expressed in reducing bacterial cytoplasm. One strategy for obtaining functionally active scFv is to translocate ...
متن کاملA Whole Genome Phylogeny Using Truncated Pivoted QR Decomposition
The increasing availability of whole genome sequences in public databases has stimulated the development of new methods to automatically compare and categorize genes and species. Recently developed methods based on the singular value decomposition (SVD) allow for the simultaneous identification and definition of well concerved motifs and gene families using very large whole genome datasets. In ...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید
ثبت ناماگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید
ورودعنوان ژورنال:
- Journal of bioinformatics and computational biology
دوره 1 3 شماره
صفحات -
تاریخ انتشار 2003